Intro

Juan Nunez-Iglesias
Victorian Life Sciences Computation Initiative (VLSCI)
University of Melbourne

Quick example: gene expression, without numpy

Cell type A Cell type B Cell type C Cell type D
Gene 0 100 200 50 400
Gene 1 50 0 0 100
Gene 2 350 100 50 200

In [ ]:
gene0 = [100, 200, 50, 400]
gene1 = [50, 0, 0, 100]
gene2 = [350, 100, 50, 200]
expression_data = [gene0, gene1, gene2]

Why is this a bad idea?

Now with NumPy


In [ ]:
import numpy as np
a = np.array(expression_data)
print(a)

We are going to:

  • Obtain an RPKM expression matrix
  • Quantile normalize the data

using the awesome power of NumPy

Inside a numpy ndarray


In [ ]:
def print_info(a):
    print('number of elements:', a.size)
    print('number of dimensions:', a.ndim)
    print('shape:', a.shape)
    print('data type:', a.dtype)
    print('strides:', a.strides)
    print('flags:')
    print(a.flags)
    
print_info(a)

In [ ]:
print(a.data)

In [ ]:
abytes = a.ravel().view(dtype=np.uint8)

In [ ]:
print_info(abytes)

In [ ]:
print(abytes[:24])

Example: take the transpose of a


In [ ]:
print_info(a)

In [ ]:
print_info(a.T)

Example: skipping rows and columns with slicing


In [ ]:
print_info(a.T)

In [ ]:
print_info(a.T[::2])

In [ ]:
print_info(a.T[::2, ::2])

Getting a copy


In [ ]:
b = a

In [ ]:
print(b)

In [ ]:
a[0, 0] = 5
print(b)
a[0, 0] = 100

Advanced operations: axis-wise evaluation


In [ ]:
expr = np.load('expr.npy')

In [ ]:
print_info(expr)

This has the raw read count data. However, each sample gets a different number of reads, so we want to normalize by the library size, which is the total number of reads across a column.

The np.sum function returns the sum of all the elements of an array. With the axis argument, you can take the sum along the given axis.


In [ ]:
lib_size = np.sum(expr, axis=0)

Exercise

Generate a 10 x 3 array of random numbers. From each row, pick the number closest to 0.75. Make use of np.abs and np.argmax to find the column j which contains the closest element in each row.


In [ ]:

Advanced operations: broadcasting

In order to normalize every column by its corresponding library size, we have to align the two arrays' axes: each dimension must be either the same size, or one of the arrays must have size 1. Use np.newaxis to match the dimensions.


In [ ]:
print(expr.shape)
print(lib_size.shape)
print(lib_size[np.newaxis, :].shape)

However, NumPy will automatically prepend singleton dimensions until the array shapes match or there is an error:


In [ ]:
np.all(expr / lib_size ==
       expr / lib_size[np.newaxis, :])

In [ ]:
expr_lib = expr / lib_size

We also multiply by $10^6$ in order to keep the numbers on a readable scale (reads per million reads).


In [ ]:
expr_lib *= 1e6

Finally, longer genes are more likely to produce reads. So we normalize by the gene length (in kb) to produce a measure of expression called Reads Per Kilobase per Million reads (RPKM).


In [ ]:
gene_len = np.load('gene-lens.npy')
print(gene_len.shape)

Exercise: broadcast expr_lib and gene_len together to produce RPKM


In [ ]:
rpkm = expr_lib  # FIX THIS

In [ ]:
from matplotlib import pyplot as plt
from scipy import stats

def plot_col_density(data, xlim=None, *args, **kwargs):
    # Use gaussian smoothing to estimate the density
    density_per_col = [stats.kde.gaussian_kde(col) for col in data.T]
    if xlim is not None:
        m, M = xlim
    else:
        m, M = np.min(data), np.max(data)
    x = np.linspace(m, M, 100)

    plt.figure()
    for density in density_per_col:
        plt.plot(x, density(x), *args, **kwargs)
    plt.xlabel('log-counts')
    plt.ylabel('frequency')
    if xlim is not None:
        plt.xlim(xlim)
    plt.show()

In [ ]:
%matplotlib inline

In [ ]:
plt.style.use('ggplot')

In [ ]:
plot_col_density(np.log(expr+1))

In [ ]:
plot_col_density(np.log(rpkm + 1), xlim=(0, 250))

Exercise: 3D broadcasting

Below, produce the array containing the sum of every element in x with every element in y


In [ ]:
x = np.random.rand(3, 5)
y = np.random.randint(10, size=8)
z = x # FIX THIS

Exercise: explicit broadcasting and stride tricks

Use np.broadcast_arrays to get the same-shape arrays that numpy adds together. Then use print_info on the output. Notice anything weird?


In [ ]:

Stride tricks

By manipulating the shape and strides of an array, we can produce a "virtual" array much bigger than its memory usage:


In [ ]:
def repeat(arr, n):
    return np.lib.stride_tricks.as_strided(arr,
                                           shape=(n,) + arr.shape,
                                           strides=(0,) + arr.strides)

In [ ]:
repeat(np.random.rand(5), 4)

Exercise: np.lib.stride_tricks.as_strided

Use as_strided to produce a sliding-window view of a 1D array.


In [ ]:
def sliding_window(arr, size=2):
    """Produce an array of sliding window views of `arr`
    
    Parameters
    ----------
    arr : 1D array, shape (N,)
        The input array.
    size : int, optional
        The size of the sliding window.
        
    Returns
    -------
    arr_slide : 2D array, shape (N - size - 1, size)
        The sliding windows of size `size` of `arr`.
        
    Examples
    --------
    >>> a = np.array([0, 1, 2, 3])
    >>> sliding_window(a, 2)
    array([[0, 1],
           [1, 2],
           [2, 3]])
    """
    return arr  # fix this

In [ ]:
# test your code here
sliding_window(np.arange(8), 3)

Fancy indexing

You can index arrays with slicing, but also with boolean arrays (including broadcasting!), integer arrays, and individual indices along multiple dimensions.


In [ ]:
values = np.array([0, 5, 99])
selector = np.random.randint(0, 3, size=(3, 4))
print(selector)
print(values[selector])

Exercise: quantile normalization

Quantile Normalization(https://en.wikipedia.org/wiki/Quantile_normalization) is a method to align distributions. Implement it using NumPy axis-wise operations and fancy indexing.

Hint: look for documentation for scipy.mstats.rankdata, np.sort, and np.argsort.


In [ ]:
def qnorm(x):
    """Quantile normalize an input matrix.
    
    Parameters
    ----------
    x : 2D array of float, shape (M, N)
        The input data, with each column being a
        distribution to normalize.
        
    Returns
    -------
    xn : 2D array of float, shape (M, N)
        The normalized data.
    """
    xn = np.copy(x) # replace this by normalizing code
    return xn

In [ ]:
logexpr = np.log(expr + 1)
logrpkm = np.log(rpkm + 1)

In [ ]:
logexprn = qnorm(logexpr)
logrpkmn = qnorm(logrpkm)

In [ ]:
plot_col_density(logexprn)

In [ ]:
plot_col_density(logrpkmn, xlim=(0, 0.25))

Advanced exercise Jack's dilemma

(If time permits.)

Date: Wed, 16 Jul 2008 16:45:37 -0500
From: Jack Cook
To: <numpy-discussion@scipy.org>
Subject: Numpy Advanced Indexing Question

Greetings,

I have an I,J,K 3D volume of amplitude values at regularly sampled time intervals. I have an I,J 2D slice which contains a time (K) value at each I, J location. What I would like to do is extract a subvolume at a constant +/- K window around the slice. Is there an easy way to do this using advanced indexing or some other method? Thanks in advanced for your help.

-- Jack


In [ ]:
# "data"

ni, nj, nk = (10, 15, 20)
amplitude = np.random.rand(ni, nj, nk)
horizon = np.random.randint(5, 15, size=(ni, nj))

In [ ]:

Even more advanced: NumPy Array Interface

An author of a foreign package (included with the exercizes as problems/mutable_str.py) provides a string class that allocates its own memory:

ipython
In [1]: from mutable_str import MutableString
In [2]: s = MutableString('abcde')
In [3]: print s
abcde

You'd like to view these mutable (mutable means the ability to modify in place) strings as ndarrays, in order to manipulate the underlying memory.

Add an array_interface dictionary attribute to s, then convert s to an ndarray. Numerically add "2" to the array (use the in-place operator +=).

Then print the original string to ensure that its value was modified.

Hint: Documentation for NumPy's __array_interface__ may be found in the online docs.

Here's a skeleton outline:


In [ ]:
import numpy as np
from mutable_str import MutableString

s = MutableString('abcde')

# --- EDIT THIS SECTION ---

# Create an array interface to this foreign object
s.__array_interface__ = {'data' : (XXX, False), # (ptr, is read_only?)
                         'shape' : XXX,
                         'typestr' : '|u1', # typecode unsigned character
                         }

# --- EDIT THIS SECTION ---

print 'String before converting to array:', s
sa = np.asarray(s)

print 'String after converting to array:', sa

sa += 2
print 'String after adding "2" to array:', s

In [ ]: